fix(chrome-ai): probe-gate caps + session/validation correctness (#514)#520
Merged
Conversation
Chrome's `LanguageModel.create` did not universally accept `tools` or
`responseConstraint` options, yet `inferWebBrowserCapabilities` always
advertised `tool-use` + `json-mode` for `chrome-prompt`/`gemini-nano`.
This caused the dispatcher to route json-mode and tool-use tasks to the
WebBrowser provider on Chrome builds that would reject them at runtime.
Adds a one-shot capability probe (`probeWebBrowserCapabilities`) that
smoke-tests `factory.create({ responseConstraint })` and
`factory.create({ tools })`, with module-level coalescing so concurrent
callers share one probe round-trip. `WebBrowserProvider` kicks the probe
off in its constructor; until it resolves, `inferCapabilities` returns
the conservative subset (no `json-mode`, no `tool-use`). Tests cover
all four probe outcome combinations, coalescing, and pre/post-ready
inference.
https://claude.ai/code/session_013PqntVCfKgKmJ5396w7BPC
…ingerprint (H1) The structured-generation run-fn dropped `sessionId` from its signature, so successive calls with the same id always rebuilt the underlying Chrome `LanguageModel` even though the surface supports session reuse. This matched the pre-session-cache behaviour rather than the post-cache shape adopted by `WebBrowser_Chat`. Accept `sessionId` as the 6th positional parameter, mirroring chat. Cache reuse is gated on a canonical schema fingerprint stored on the cache entry — a schema change forces a rebuild because Chrome's `responseConstraint` state is bound at first-prompt and re-feeding a different schema is undefined behaviour. On stream failure the entry is dropped + destroyed via the same `cacheWritten` / `dropChromeSessionEntry` dance as chat. `ChromeChatSessionState` grows an optional `schemaFingerprint` field. https://claude.ai/code/session_013PqntVCfKgKmJ5396w7BPC
… (H2) `WebBrowser_ToolCalling` ignored both `outputSchema` and `sessionId` — the 5th and 6th positional parameters of the run-fn contract — so multi-turn tool-calling rebuilt the `LanguageModel` each turn. Accept both parameters. Cache reuse keys on a sorted-tool-name fingerprint (Chrome binds `tools` at `create()` time and can't hot-swap them per turn). We only cache when the orchestrator drives via `input.messages` because Chrome's tool-calling loop appends tool-result turns to the session's internal state opaquely — reusing a cached session across a turn the orchestrator hasn't fully replayed would double-feed those results. Bare-prompt callers always rebuild. On any error we drop + destroy the cache entry: Chrome's internal state may be mid-tool-call-cycle. `ChromeChatSessionState` grows an optional `toolsFingerprint` field. https://claude.ai/code/session_013PqntVCfKgKmJ5396w7BPC
… (H3) Chrome's `LanguageModel` invokes our stub `execute` callback with whatever arguments the model emits. `filterValidToolCalls` only checked the tool name, so a hallucinated arg shape was forwarded to the orchestrator verbatim — leaving the downstream tool runner to either fail or silently produce garbage. Compile each tool's `inputSchema` once via `compileSchema` (cached by name) before the stream starts. After streaming we validate every captured call's `input` against its tool's validator; failures are dropped + warn-logged in the same shape as `filterValidToolCalls`'s existing name-only warning. Tools whose `inputSchema` fails to compile emit a single warning and fall through to the name-only check rather than failing the whole run. https://claude.ai/code/session_013PqntVCfKgKmJ5396w7BPC
…ma (H4)
Chrome's `responseConstraint` is best-effort, not a hard guarantee — the
model can still produce a partial or shape-mismatched payload. The
existing fallback (`parsePartialJson(...) ?? {}`) handed downstream code
an empty object cast to the output type, indistinguishable from a
legitimate empty payload. Worse, that path emitted a `finish` event, so
`StructuredGenerationTask`'s retry loop had no signal to retry on.
Compile the validator once via `compileSchema`. After streaming:
- If neither `JSON.parse` nor `parsePartialJson` produces a value:
throw `PermanentJobError("Chrome AI returned unparseable JSON")`.
- If validation fails: throw with the first validator error message.
- Only on success do we emit `finish` and write the cache entry.
`StructuredGenerationTask.executeStream` catches per-attempt errors and
retries, so throwing here is the correct signal — no `finish` so the
loop knows this attempt failed. Schema compile failures are also
surfaced as `PermanentJobError` (so retries don't burn through quota on
a malformed schema).
https://claude.ai/code/session_013PqntVCfKgKmJ5396w7BPC
@workglow/cli
@workglow/ai
@workglow/browser-control
@workglow/indexeddb
@workglow/javascript
@workglow/job-queue
@workglow/knowledge-base
@workglow/mcp
@workglow/storage
@workglow/task-graph
@workglow/tasks
@workglow/util
workglow
@workglow/anthropic
@workglow/bun-webview
@workglow/chrome-ai
@workglow/electron
@workglow/google-gemini
@workglow/huggingface-inference
@workglow/huggingface-transformers
@workglow/node-llama-cpp
@workglow/ollama
@workglow/openai
@workglow/playwright
@workglow/postgres
@workglow/sqlite
@workglow/supabase
@workglow/tf-mediapipe
commit: |
Coverage Report
File CoverageNo changed files found. |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens the @workglow/chrome-ai provider by (1) probing Chrome Prompt API feature support before advertising json-mode / tool-use, and (2) fixing session reuse + schema validation correctness for Structured Generation and Tool Calling run functions.
Changes:
- Add a module-level capability probe (coalesced) and wire it into
WebBrowserProviderwith aready()hook and conservative pre-probe capability inference. - Fix
sessionIdhandling and cache invalidation rules forWebBrowser_StructuredGenerationandWebBrowser_ToolCalling, including schema/toolset fingerprinting. - Add schema validation for Tool Calling args (
inputSchema) and Structured Generation final JSON (outputSchema), plus expand provider test coverage substantially.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| providers/chrome-ai/src/ai/WebBrowserProvider.ts | Kicks off capability probing in the constructor; exposes ready(); gates inferred capabilities using probed results. |
| providers/chrome-ai/src/ai/index.ts | Extends _testOnly exports for probe helpers and run-fns to support new tests. |
| providers/chrome-ai/src/ai/common/WebBrowser_ToolCalling.ts | Adds sessionId support, toolset fingerprinting + caching rules, and validates tool-call args against each tool’s inputSchema. |
| providers/chrome-ai/src/ai/common/WebBrowser_StructuredGeneration.ts | Adds sessionId support, schema fingerprinting + caching, and validates final JSON against outputSchema with PermanentJobError failures. |
| providers/chrome-ai/src/ai/common/WebBrowser_Sessions.ts | Extends cached session state to store schema/tool fingerprints alongside the session + message watermark. |
| providers/chrome-ai/src/ai/common/WebBrowser_CapabilityProbe.ts | New probe module that smoke-tests optional Chrome Prompt API surfaces and caches the result. |
| providers/chrome-ai/src/ai/common/WebBrowser_Capabilities.ts | Updates capability inference to conditionally include json-mode / tool-use; adds async inference helper. |
| packages/test/src/test/ai-provider/WebBrowserProvider.test.ts | Adds extensive tests covering probe behavior/coalescing, caching correctness, and schema validation behaviors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
sroussey
added a commit
that referenced
this pull request
May 22, 2026
…apability probe Integrates the chrome-ai branch (7 commits — PR #514/#520/#528) with main's parallel chrome-ai work (model.download, model.dispose, ApiBinding): - Chat-session cache keyed by AiChatTask sessionId, with messageCount high-water mark for reuse (replaces fingerprint-based invalidation) - StructuredGeneration + ToolCalling run-fns gated by an async capability probe; pre-probe state advertises a conservative subset (no json-mode, no tool-use) so the provider never claims a capability it can't fulfil - ChatHistory helpers + WebBrowser_TextGeneration_Unified dispatcher (text.generation shared by AiChatTask + TextGenerationTask) - ChromeHelpers ships both assertAvailability and ensureAvailable; both session APIs (chrome-chat cache + idle-evict store) coexist - Drops main's WebBrowser_Chat.test.ts (chrome-ai's WebBrowserProvider.test already covers chat behavior under the new cache semantics)
sroussey
added a commit
that referenced
this pull request
May 22, 2026
…viders Addresses review of #514/#520/#528 rebase: CRITICAL fix — `model.dispose` now reaches chat-cached sessions. The post-rebase chrome-ai branch had two parallel session maps (`chromeSessions` for chat reuse, `sessions` for idle-evict + ModelDispose lookup) but only the chat map was populated by runtime code, making `model.dispose` a functional no-op in production. Unified into a single Map<sessionId, WebBrowserSessionEntry> with both chat-cache fields (messageCount, fingerprints) and lifecycle fields (modelKey, lastUsedAt, idleTimer). `ChromeChatSessionState` now requires `modelKey`. `disposeWebBrowserSessionsForModel(modelKey)` iterates the unified store, so model.dispose destroys chat-cached sessions. Chat sessions become subject to idle eviction (free bonus). IMPORTANT — sanitizeToolArgs applied across the codebase per intent of the prior refactor: - OpenAIShapedChat (parseOpenAIToolCallMessage + accumulateOpenAIStream) → covers OpenAI + HFI - ToolCallParsers (adaptParserResult + parseToolCallsFromText) → covers llama.cpp Hermes/Liquid/Qwen35/Llama paths + HFT - Anthropic_ToolCalling (input_json_delta + content_block_stop) - Gemini_ToolCalling (functionCall.args) - Ollama_ToolCalling (parsed function.arguments) - LlamaCpp_ToolCalling (extractNativeFunctionCalls) - Cactus_ToolCalling[.browser] (JSON-parse parseToolCalls paths) Every model-supplied tool-arg payload now passes through sanitizeToolArgs before reaching downstream consumers, closing the prototype-pollution vector across the provider matrix. Also: - Added packages/test/src/test/ai/ToolCallingUtils.test.ts (14 unit tests for sanitizeToolArgs, compileToolValidators, validateToolCallArgs, plus a sanitize→validate→name-check integration test). - Added WebBrowser_Sessions.test regression for the unified-store behavior (disposeWebBrowserSessionsForModel sees chat-cached entries). - Documented WebBrowser_Chat's rebuild-on-next-turn recovery model (vs the in-fn retry that main's now-deleted test exercised).
sroussey
added a commit
that referenced
this pull request
May 22, 2026
* feat(chrome-ai): chat history, tool calling, structured generation, capability probe Integrates the chrome-ai branch (7 commits — PR #514/#520/#528) with main's parallel chrome-ai work (model.download, model.dispose, ApiBinding): - Chat-session cache keyed by AiChatTask sessionId, with messageCount high-water mark for reuse (replaces fingerprint-based invalidation) - StructuredGeneration + ToolCalling run-fns gated by an async capability probe; pre-probe state advertises a conservative subset (no json-mode, no tool-use) so the provider never claims a capability it can't fulfil - ChatHistory helpers + WebBrowser_TextGeneration_Unified dispatcher (text.generation shared by AiChatTask + TextGenerationTask) - ChromeHelpers ships both assertAvailability and ensureAvailable; both session APIs (chrome-chat cache + idle-evict store) coexist - Drops main's WebBrowser_Chat.test.ts (chrome-ai's WebBrowserProvider.test already covers chat behavior under the new cache semantics) * refactor(ai,chrome-ai,openai,hfi): shared tool sanitation; emit-pattern streams Tool calling utilities (packages/ai/src/task/ToolCallingUtils.ts): - sanitizeToolArgs: recursive __proto__/constructor/prototype scrubbing for model-supplied tool args (prototype-pollution defence) - compileToolValidators + validateToolCallArgs: per-tool inputSchema validation with graceful fallback for tools whose schema fails to compile Stream helpers converted from generators to emit-callback so run-fns no longer need a for-await/yield pump: - snapshotStreamToTextDeltas / snapshotStreamToSnapshots (chrome-ai) - accumulateOpenAIStream (@workglow/ai provider-utils, used by OpenAI + HFI) Run-fns updated to call helpers with emit directly and emit their own final 'finish' event. chrome-ai's WebBrowser_ToolCalling drops its private sanitization + validation copy and reuses the shared utils. * fix(chrome-ai): wire model.dispose; apply sanitizeToolArgs across providers Addresses review of #514/#520/#528 rebase: CRITICAL fix — `model.dispose` now reaches chat-cached sessions. The post-rebase chrome-ai branch had two parallel session maps (`chromeSessions` for chat reuse, `sessions` for idle-evict + ModelDispose lookup) but only the chat map was populated by runtime code, making `model.dispose` a functional no-op in production. Unified into a single Map<sessionId, WebBrowserSessionEntry> with both chat-cache fields (messageCount, fingerprints) and lifecycle fields (modelKey, lastUsedAt, idleTimer). `ChromeChatSessionState` now requires `modelKey`. `disposeWebBrowserSessionsForModel(modelKey)` iterates the unified store, so model.dispose destroys chat-cached sessions. Chat sessions become subject to idle eviction (free bonus). IMPORTANT — sanitizeToolArgs applied across the codebase per intent of the prior refactor: - OpenAIShapedChat (parseOpenAIToolCallMessage + accumulateOpenAIStream) → covers OpenAI + HFI - ToolCallParsers (adaptParserResult + parseToolCallsFromText) → covers llama.cpp Hermes/Liquid/Qwen35/Llama paths + HFT - Anthropic_ToolCalling (input_json_delta + content_block_stop) - Gemini_ToolCalling (functionCall.args) - Ollama_ToolCalling (parsed function.arguments) - LlamaCpp_ToolCalling (extractNativeFunctionCalls) - Cactus_ToolCalling[.browser] (JSON-parse parseToolCalls paths) Every model-supplied tool-arg payload now passes through sanitizeToolArgs before reaching downstream consumers, closing the prototype-pollution vector across the provider matrix. Also: - Added packages/test/src/test/ai/ToolCallingUtils.test.ts (14 unit tests for sanitizeToolArgs, compileToolValidators, validateToolCallArgs, plus a sanitize→validate→name-check integration test). - Added WebBrowser_Sessions.test regression for the unified-store behavior (disposeWebBrowserSessionsForModel sees chat-cached entries). - Documented WebBrowser_Chat's rebuild-on-next-turn recovery model (vs the in-fn retry that main's now-deleted test exercised). * feat(chrome-ai): retry once on InvalidStateError when a cached session is destroyed Chrome can destroy a `LanguageModel` session out from under us (tab backgrounding, GPU process restart, memory pressure). When a cached session's `promptStreaming` throws DOMException("...destroyed...", "InvalidStateError") we now rebuild the session from full history via `initialPrompts` and retry the prompt once. Retry is gated on three conditions, all required: - We were using a CACHED session (a fresh-session failure means the model is broken; retrying won't help). - No text-delta has reached the consumer yet (we can't unsend deltas). - The error name is `InvalidStateError` (matches Chrome's InvalidStateError DOMException; tolerant of message-text changes). Tests: - "retries once with a fresh session when a cached session is destroyed" seeds the cache on turn 1, has the cached session's promptStreaming throw on turn 2's reuse, asserts rebuild + retry + cache replacement. - "does not retry when a fresh (non-cached) session fails" guards the first gate.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacks fixes on top of #514's
chrome-aibranch. Five focused commits — one per issue.Summary
C1 — Probe-gate
tool-useandjson-modeinferWebBrowserCapabilitiesunconditionally advertisedjson-mode+tool-useforchrome-prompt/gemini-nano, butLanguageModel.create'stoolsandLanguageModel.prompt'sresponseConstraintaren't universally accepted across Chrome builds. The dispatcher could route a json-mode/tool-use task to a provider that would reject it at runtime.providers/chrome-ai/src/ai/common/WebBrowser_CapabilityProbe.ts: one-shot probe with module-level promise coalescing; smoke-tests both options independently and immediately destroys the test sessions.WebBrowserProviderconstructor kicks off the probe and stores result onthis.probedCaps. Provider exposesready(): Promise<void>.json-mode, notool-use). Post-probe: reflects browser surface.inferWebBrowserCapabilities(model, probed?)defaults to{jsonMode: true, toolUse: true}for back-compat; newinferWebBrowserCapabilitiesAsyncdrives the probe.providers/chrome-ai/src/ai/common/WebBrowser_Capabilities.ts,providers/chrome-ai/src/ai/WebBrowserProvider.ts,providers/chrome-ai/src/ai/index.ts.H1 —
WebBrowser_StructuredGenerationacceptssessionIdThe run-fn dropped
sessionIdfrom its signature, so successive calls with the same id always rebuilt the underlyingLanguageModel.sessionIdas the 6th positional param.ChromeChatSessionState.schemaFingerprint.responseConstraintstate is bound at first prompt).cacheWritten/dropChromeSessionEntrydance asWebBrowser_Chat.providers/chrome-ai/src/ai/common/WebBrowser_StructuredGeneration.ts(line ~41 signature),providers/chrome-ai/src/ai/common/WebBrowser_Sessions.ts(extendedChromeChatSessionState).H2 —
WebBrowser_ToolCallingacceptssessionIdIgnored both
outputSchemaandsessionId.ChromeChatSessionState.toolsFingerprint. Tool-set change rebuilds.input.messagesis present. Bare-prompt callers always rebuild because Chrome appends tool-result turns to the session's internal state opaquely — reusing a cache the orchestrator hasn't fully replayed would double-feed results. Documented in code.providers/chrome-ai/src/ai/common/WebBrowser_ToolCalling.ts(line ~104 signature).H3 — Validate tool-call arguments against
inputSchemacallInput = (args[0] ?? {})was forwarded verbatim;filterValidToolCallsonly checked the tool name.inputSchemaonce viacompileSchemafrom@workglow/util/schema, cached by name.filterValidToolCalls. Invalid → drop +getLogger().warn(...)matching the existing name-only warning style.inputSchemafails to compile log once and fall through to name-only validation (no run-level crash).providers/chrome-ai/src/ai/common/WebBrowser_ToolCalling.ts(line ~128 execute stub + new validator pass).H4 — Validate StructuredGeneration final JSON against
outputSchemaWhen
JSON.parsefailed ANDparsePartialJsonreturnedundefined, the run-fn cast{}to the output type, emitted afinishevent, and downstream code had no way to distinguish that from a legitimate empty payload.compileSchemaat the top of the run; compile failure →PermanentJobError("invalid outputSchema")(avoids burning retry budget on a malformed schema).PermanentJobError("Chrome AI returned unparseable JSON"). Nofinishemitted.PermanentJobError("Chrome AI output failed schema validation: ..."). Nofinishemitted.finishemitted and the cache entry written.providers/chrome-ai/src/ai/common/WebBrowser_StructuredGeneration.ts(lines ~94, ~96 of the original; reworked).StructuredGenerationTask.executeStream(inpackages/ai) wrapssuper.executeStream(currentInput, context)in a per-attempt for-await and validates per finish, so a thrown error correctly fails the attempt and the loop retries up tomaxRetries. Throwing without emittingfinishis the right shape.Test plan
bun test packages/test/src/test/ai-provider/WebBrowserProvider.test.ts— 46 tests pass (up from 19).tsgo --noEmitclean onproviders/chrome-ai/andpackages/test/.bunx vitest run packages/test/src/test/ai-provider— only failures are unrelated (HFT bbox unit test, llamacpp model download race) and reproduce on the base branch.tools/responseConstraint(probe should gate them out).Open questions
create({ responseConstraint })andcreate({ tools }). Per specresponseConstraintactually lives onprompt()options, notcreate()— a build that accepts unknown create options silently could give us a false positive. The user brief explicitly asked for the create-time test for both, and reviewing the chromium typestoolsis a create-time option whileresponseConstraintis per-prompt. If we want a tighter signal we could additionally run a shortpromptStreamingwith the constraint and read one chunk. Worth a follow-up.StructuredGenerationTask.executeStreamcatches per-attempt errors from the run-fn? Inspected the task and confirmed it iterates per-attempt and validates on finish, so throwing withoutfinishshould retry. Could not run against a real failing model — please verify with a live Chrome AI smoke test.ChromeChatSessionState. They're string-typed and unbounded — for very large schemas the canonical-stringify cost is non-trivial. If we see a hot path, hash to a fixed-length digest.@workglow/util/workerdeliberately excludescompileSchema(json-schema-library + URI.js + nearley + json-pointer is heavyweight). H3/H4 import from@workglow/util/schemawhich pulls those in.bun build --packages=externalkeeps them external so worker startup cost grows only if the consumer actually imports. Worth confirming the worker bundle size delta is acceptable, or guarding the validation behind a no-op fallback when running in the worker entry.Generated by Claude Code